Lung cancer is a highly prevalent and deadly form of cancer that is responsible for a significant number of deaths worldwide. Lung cancer can be classified into:
The majority of lung cancer cases fall under the category of non-small-cell lung cancer (NSCLC), which accounts for approximately 85% of new lung cancer diagnoses. NSCLC can be further classified into three main categories:
Image 1: Classification of lung cancer
Smoking remains the primary risk factor for NSCLC, however, other factors such as radon exposure and air pollution also contribute to its development (Gridelli et al., 2015).
Treatment options for NSCLC comprehend surgery, radiochemotherapy, immunotherapy, and targeted therapies. Targeted therapies specifically target genetic mutations found in tumors, such as those in the epidermal growth factor receptor (EGFR) and anaplastic lymphoma kinase (ALK). The epidermal growth factor receptor (EGFR) is a cell surface receptor that plays a crucial role in regulating various cellular processes, including cell division, cell migration, and cell survival. Upon binding to its ligands (epidermal growth factors), EGFR triggers a signaling cascade inside the cell, activating multiple pathways that influence cell behavior. EGFR is located on chromosome 7 and it is a member of the ErbB family of cell surface tyrosine kinases, with a molecular weight of 170 kDa. Structurally, the EGFR is a transmembrane receptor that consists of three parts: an extracellular domain responsible for ligand binding, a transmembrane domain, and an intracellular domain with a tyrosine kinase activity. Activation of the EGFR occurs when a ligand, such as epidermal growth factor binds to the extracellular portion of the receptor. This ligand binding leads to receptor dimerization, leading to the binding with other related receptors. Without ligand binding and subsequent dimerization, the intracellular tyrosine kinase domain of EGFR remains inactive (Prabhakar, 2015).
EGFR overexpression is present in 60% of NSCLC and for this reason EGFR mutation is well-known as the main oncogenic driven mutation in some NSCLC. There are two main types of EGFR mutations that occur in non-small-cell lung cancer (NSCLC). These mutations include L858R, which involves point mutations in exon 21 (resulting in a substitution of leucine with arginine on codon 858), and exon 19 deletions, which are in-frame deletions occurring in exon 19. These two mutations account for approximately 90% of all EGFR mutations observed in NSCLC (Hsu et al., 2019).
Drugs known as EGFR inhibitors, such as tyrosine kinase inhibitors (TKIs), are used in the treatment of cancers with EGFR mutations. These inhibitors block the activity of the EGFR protein, reducing the signaling that drives cancer cell growth. EGFR inhibitors, like Osimertinib, gefitinib, and erlotinib, have shown efficacy in cases of NSCLC with L858R mutation or exon 19 deletions. In particular, Gefitinib and erlotinib are categorized as 1st-generation EGFR-TKIs, while afatinib and dacomitinib are considered 2nd-generation, and osimertinib belongs to the 3rd-generation. The 1st-generation EGFR-TKIs, bind reversibly to mutant EGFR but are not effective against the acquired T790M mutation. On the other hand, the 2nd-generation EGFR-TKIs, afatinib and dacomitinib, form irreversible covalent bonds with all ErbB receptors (EGFR, ErbB2, ErbB4, and ErbB heterodimers), but again, they do not work against the T790M mutation. Osimertinib, the 3rd-generation EGFR-TKI, on the other hand, exhibits irreversible covalent binding to mutant EGFR and is active against the T790M mutation.
Image 2: EGFR pathway
The T790M mutation is a specific mutation occurring in exon 20, where methionine is substituted with threonine at amino acid position 790. This mutation, known as the “gatekeeper” mutation in the EGFR kinase domain, affects the binding of 1st and 2nd generation EGFR-TKIs to the ATP-binding pocket. The T790M mutation accounts for 60% of secondary EGFR point mutations in NSCLC patients who develop acquired resistance to 1st and 2nd generation EGFR-TKIs. Additionally, the T790M mutation can also be found de novo in EGFR-TKI-naïve NSCLC patients, most of whom do not respond to treatment with 1st and 2nd generation EGFR-TKIs (Malapelle et al., 2018).
Clinical trial evidence has shown that osimertinib is effective (with a response rate of approximately 70%) in treating NSCLC patients who have acquired resistance to 1st and 2nd generation EGFR-TKIs due to the T790M mutation. It has also been shown that osimertinib leads to significantly longer progression-free survival compared to gefitinib and erlotinib in untreated advanced NSCLC patients with EGFR mutations. Despite these promising results, it has been shown the appearance of some drug tolerant persisters cells which are not responsive to Osimertinib. These DTPs were further analyzed in the study.
The researchers profiled osimertinib drug tolerant persisters (DTPs) using RNA-seq to characterize the features of these cells and performed drug screens to identify new therapeutic opportunities. Different NSCLC EGFR mutant cell lines (PC9, NCI-H1975, HCC827, and HCC2935) were treated with 500 nM osimertinib for 21 days (osimertinib DTPs). Cells were either harvested immediately or washed 2x with PBS and then replaced with drug-free media for a further 24 hrs. In parallel, parental cell lines were grown in drug-free media for 21 days then treated with 500 nM osimertinib for 24 hrs(osimertinib acute) or DMSO (control) for 24 hrs. Experiment was done using biological triplicates. There are 60 total RNA-seq samples, however for the aim of this project I am going to focus only on a specific cell line: H1975, selecting only H1975_osi_DTP and H1975_DMSO samples, for a total of 6 samples: 3 treated and 3 controls. Fastq files of RNA-seq were aligned using HISAT2 (v2.1.0) and Counts were quantified using Salmon (v0.8.2) using Ensembl v79 annotations. tximport was used to generate gene level count data using option countsFromAbundance = “lengthScaledTPM” and a log2 transcripts per kilobase million, log2(TPM +0.01), abundance matrix. The data was stored into a “GSE193258_RNAseq_estimated_counts.tsv.gz” file.
To begin, I set the working directory which contains the dataset.
setwd("/Users/martinaterenzi/Desktop/Exam")
The first step is to perform data normalization, which is essential for visualizing bulk RNA-seq data. In fact, the purpose of normalization is to account for differences in library sizes and sequencing depths between samples, allowing for more accurate comparisons of gene expression levels. The counts corresponding to each gene can be normalized using various methods such as TPM, FPK, RPKM and CPM. In particular, I utilized the counts per million (CPM) normalization method which uses the following formula: CPM = (Raw count for a gene / Total number of reads in the sample) * Scaling factor (10^6).
file_path= "/Users/martinaterenzi/Desktop/Exam/GSE193258_RNAseq_estimated_counts.tsv"
data=read.table(file_path, header = TRUE, sep = "\t")
counts=data.frame(data[,c(1,32,33,34,41,42,43)], row.names = rownames(data))
counts_1=data.frame(data[,c(32,33,34,41,42,43)], row.names = rownames(data))
gene_names=data$gene
rownames(counts_1)=gene_names
head(counts_1)
## H1975_DMSO_1 H1975_DMSO_2 H1975_DMSO_3 H1975_osi_DTP_1 H1975_osi_DTP_2
## A1BG 788.8477841 436.0194563 624.1128152 888.7190056 851.905787
## A1CF 5.9176652 12.3258496 0.8429555 0.8579344 8.969218
## A2M 0.3681615 0.7309377 0.0000000 124.1034271 44.935687
## A2ML1 10.0186959 38.1575548 17.7310213 10.8636700 12.917684
## A3GALT2 1.0000591 0.9945132 2.9958185 2.0160282 3.036452
## A4GALT 564.3877865 340.1630090 409.8402777 668.9767513 655.724842
## H1975_osi_DTP_3
## A1BG 850.062314
## A1CF 9.416306
## A2M 21.881597
## A2ML1 18.431924
## A3GALT2 3.039603
## A4GALT 631.619819
The raw counts are stored into a file with the .tsv extension which refers to a Tab-Separated Values file. TSV is a text file format used for storing structured data. To normalize the data, the counts within the raw count must be converted into counts per million (CPM) using the edgeR package from Bioconductor. For the purpose of this analysis, I created a new dataframe “counts_1” selecting only H1975_DMSO and H1975_osi_DTP treated cells from the initial dataframe. The normalized data are stored in “H1975_table_cpm.txt”.
matrix=as.matrix(counts_1, rownames.force=NA)
library(edgeR)
## Loading required package: limma
matrix_cpm = edgeR::cpm(matrix)
head(matrix_cpm)
## H1975_DMSO_1 H1975_DMSO_2 H1975_DMSO_3 H1975_osi_DTP_1 H1975_osi_DTP_2
## A1BG 16.595225463 10.28022483 14.54158901 20.17025642 17.3383875
## A1CF 0.124491683 0.29061204 0.01964054 0.01947157 0.1825457
## A2M 0.007745124 0.01723364 0.00000000 2.81663600 0.9145522
## A2ML1 0.210766286 0.89965766 0.41312599 0.24656051 0.2629068
## A3GALT2 0.021038540 0.02344808 0.06980142 0.04575553 0.0617993
## A4GALT 11.873193732 8.02017470 9.54912114 15.18301345 13.3456206
## H1975_osi_DTP_3
## A1BG 20.58156499
## A1CF 0.22798601
## A2M 0.52979352
## A2ML1 0.44627064
## A3GALT2 0.07359435
## A4GALT 15.29267223
write.table(matrix_cpm, file = "H1975_table_cpm.txt", sep ="\t", col.names=NA)
After normalizing the data, the next step is to visualize it through Principal Component Analysis. The main goal of PCA is to transform a high-dimensional dataset into a lower-dimensional space while preserving the most important information or patterns present in the data. It achieves this by finding new orthogonal axes, called principal components, that capture the maximum variance in the data. By reducing the data into a linear space, PCA facilitates its visualization. In the analyzed dataset (6 conditions), it is not possible to visualize the data in 6 dimensions, so 6 PCs were determined through PCA. This was achieved with the command prcomp applied to the normalized counts.
normalized_counts = read.table("H1975_table_cpm.txt", sep="\t", header = TRUE, row.names=1)
normalized_counts=log2(normalized_counts+1)
pca=prcomp(t(normalized_counts))
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 50.3945 16.3950 11.98016 10.32848 8.41077 1.372e-13
## Proportion of Variance 0.8115 0.0859 0.04586 0.03409 0.02261 0.000e+00
## Cumulative Proportion 0.8115 0.8974 0.94330 0.97739 1.00000 1.000e+00
Printing the summary of the PCA gives back the following information: standard deviation, the proportion of variance and the cumulative proportion for each element. Once the dimensional reduction Is performed, it is possible to represent the data into a screeplot. The scree plot is a graphical representation of the eigenvalues or the amount of variance explained by each principal component in a Principal Component Analysis (PCA) which indicates the proportion of total variance explained by that component. The scree plot has two main functions in PCA:
Visualization: The scree plot helps visualize the amount of variance explained by each principal component. It displays the eigenvalues on the y-axis (percent variation), which represents the proportion of variance explained, and the corresponding principal component numbers on the x-axis.
Determine the number of principal components: The scree plot helps determining the number of principal components to retain for further analysis.
variance= pca$sdev^2
variance_perc= round(variance/sum(variance)*100, 1)
png("H1975_screeplot.png", width=6000, height=5000, res=700)
barplot(variance_perc, main="Scree Plot", xlab="Principal Component", ylab="Percent Variation", ylim = c(0,100))
dev.off()
## quartz_off_screen
## 2
Screeplot
Looking at the screeplot I retained only PC1 and PC2 for further analysis because these components capture the most significant variation in the original data. At this point, I created a 2D plot containing only PC1 and PC2 using the ggplot function. The plot was saved as “PCA_plot.pdf”.
library(ggplot2)
pca_table= data.frame(Sample=rownames(pca$x),
PC1=pca$x[,1],
PC2=pca$x[,2])
dim(pca_table)
## [1] 6 3
head(pca_table)
## Sample PC1 PC2
## H1975_DMSO_1 H1975_DMSO_1 -52.97515 18.241784
## H1975_DMSO_2 H1975_DMSO_2 -33.32743 -25.715923
## H1975_DMSO_3 H1975_DMSO_3 -49.23097 5.311382
## H1975_osi_DTP_1 H1975_osi_DTP_1 48.05740 5.096384
## H1975_osi_DTP_2 H1975_osi_DTP_2 33.20917 -13.537976
## H1975_osi_DTP_3 H1975_osi_DTP_3 54.26698 10.604349
png("PCA_plot.png", width=6000, height=5000, res=700)
ggplot(data=pca_table, aes(x=PC1, y=PC2, color=Sample)) +
geom_point() +
scale_color_manual(labels = c(rep("treated_osi",3), rep("control_DMSO",3)),
values = c(rep("violetred1",3), rep("mediumpurple3",3))) +
xlab(paste("PC1 - ", variance_perc[1], "%", sep="")) +
ylab(paste("PC2 - ", variance_perc[2], "%", sep="")) +
ggtitle("PCA samples")
dev.off()
## quartz_off_screen
## 2
From this plot is possible to determine:
Proximity of points: The proximity of data points in the PCA plot indicates their similarity or dissimilarity in terms of features. Points that are close to each other on the plot are likely to have similar patterns or characteristics, while points that are far apart are likely to have distinct patterns.
Clusters and outliers: If points of the same category or class are clustered together, it suggests that they share common characteristics or patterns.
Variance explained: Consider the amount of variance explained by each principal component. The proportion of variance explained by a component can be determined by its associated eigenvalue. Higher eigenvalues indicate more variance explained.
By looking at the the graph we can see that the controls are well separated from the treatments. The separation is mainly in PC1 which is responsible for 81.2% of the variation among samples. However, we can also observe that there is not clear clustering of the data which can suggest that the dataset itself does not contain enough information or discriminative power to reveal distinct clusters in the reduced-dimensional space.
Next, I used the loading scores (proportions of the impact of each gene on a specific PC) to determine which are the genes that contribute the most to the positioning of the samples along PC1. Since I want to understand which are the genes that have made the samples positioning on the right or on the left of the graph, I consider the absolute values of the loading scores. The scores were ranked from top to bottom and the top 10 genes with the largest loading score magnitude were selected. Finally, it is possible to determine which genes have positive or negative loading scores determining the positioning of the samples in the PCA plot (positive: right and negative: left). The results were saved in the following files: “pc1_top10genes.txt”,“pc2_top10genes.txt” and can be visualized here.
pc1_loading_scores = pca$rotation[,1]
pc1_gene_scores = abs(pc1_loading_scores)
pc1_rank_gene_score= sort(pc1_gene_scores, decreasing=TRUE)
pc1_top_10_genes= names(pc1_rank_gene_score[1:10])
pc1_top_10_genes
## [1] "ESRP1" "CDH1" "LAD1" "MAL2" "VGF" "TMEM30B" "CNN1"
## [8] "EDN2" "MYEOV" "RTL1"
pc1_top_10_genes = pca$rotation[pc1_top_10_genes,1]
write.table(pc1_top_10_genes, file = "pc1_top10genes.txt", sep ="\t", col.names=NA)
The same analysis was performed for PC2
pc2_loading_scores = pca$rotation[,2]
pc2_gene_scores = abs(pc2_loading_scores)
pc2_rank_gene_score= sort(pc2_gene_scores, decreasing=TRUE)
pc2_top_10_genes = names(pc2_rank_gene_score[1:10])
pc2_top_10_genes
## [1] "FOS" "PLCB4" "ANXA8L1" "ANXA8" "AL035078.4"
## [6] "SERPINA5" "CXCR5" "PLXNA4" "PLEKHS1" "CCN5"
pc2_top_10_genes = pca$rotation[pc2_top_10_genes,1]
write.table(pc2_top_10_genes, file = "pc2_top10genes.txt", sep ="\t", col.names=NA)
Differential expression analysis is a common technique used to identify genes or features that show significant changes in expression levels between different conditions or groups. It helps to understand how gene expression varies under different biological contexts, such as, treated vs. untreated. In particular, in this dataset I want to determine which genes are differentially expressed after the treatment of the cells with Osimertinib drug compared to the controls (DMSO).
To perform this analysis I used the DESeq2 package which is a widely used R package for differential gene expression analysis from RNA-seq count data. This package requires the raw counts. The command works with integer numbers, for this reason I first converted the raw counts into integers in the “treated” dataframe. The dataframe is then converted into a matrix which in turn is used to create a dataframe coldata which specifies the names of the samples (derived from the column names of the count matrix) and assigns a condition factor, with “ctrl” indicating control samples and “treat” indicating treatment samples. Using these conditions, I determined the object dds as the dataset on which I performed the DE analysis. dds = DESeq(dds)`: This command fits the DESeq model to the “dds” object, estimating size factors, dispersions, and fitting the negative binomial model for differential expression analysis. The results are saved in the variable res together with the gene symbols. Finally, I have filtered the results selecting the genes that present a significative p-value (equal or smaller than 0.05) and an absolute LogFoldChange (LFC) equal or greater than 1.
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:limma':
##
## plotMA
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
treated=counts_1
treated=as.data.frame(lapply(treated, as.integer))
head(treated)
## H1975_DMSO_1 H1975_DMSO_2 H1975_DMSO_3 H1975_osi_DTP_1 H1975_osi_DTP_2
## 1 788 436 624 888 851
## 2 5 12 0 0 8
## 3 0 0 0 124 44
## 4 10 38 17 10 12
## 5 1 0 2 2 3
## 6 564 340 409 668 655
## H1975_osi_DTP_3
## 1 850
## 2 9
## 3 21
## 4 18
## 5 3
## 6 631
matrix_treated=as.matrix(treated)
coldata=data.frame(samples=dimnames(matrix_treated)[[2]], condition=factor(c(rep("ctrl",3), rep("treat",3))))
dds=DESeqDataSetFromMatrix(countData = matrix_treated, colData = coldata, design = ~condition)
dds=DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res=results(dds)
results_df=as.data.frame(res)
results_df$gene=counts$gene
head(results_df)
## baseMean log2FoldChange lfcSE stat pvalue padj
## 1 735.660787 0.49501760 0.2094534 2.363377630 1.810921e-02 4.360523e-02
## 2 5.677788 -0.01118102 1.6634579 -0.006721551 9.946370e-01 9.975953e-01
## 3 31.190673 8.40672897 1.4061637 5.978485321 2.252219e-09 2.435139e-08
## 4 17.783363 -0.71392496 0.7695154 -0.927759115 3.535325e-01 4.921508e-01
## 5 1.827968 1.40932018 1.9393552 0.726695247 4.674127e-01 6.050703e-01
## 6 541.280267 0.58035512 0.1984846 2.923930673 3.456417e-03 1.030488e-02
## gene
## 1 A1BG
## 2 A1CF
## 3 A2M
## 4 A2ML1
## 5 A3GALT2
## 6 A4GALT
results_df=results_df[!is.na(results_df$log2FoldChange),]
results_df=results_df[which((results_df$padj<=0.05) &(abs(results_df$log2FoldChange)>=1)),]
head(results_df)
## baseMean log2FoldChange lfcSE stat pvalue padj gene
## 3 31.19067 8.406729 1.4061637 5.978485 2.252219e-09 2.435139e-08 A2M
## 14 519.24891 -1.185638 0.1758240 -6.743319 1.548084e-11 2.373160e-10 AADAT
## 17 443.60869 1.354595 0.2163098 6.262289 3.793659e-10 4.656717e-09 AAMDC
## 19 43.91813 -3.340325 0.5007026 -6.671275 2.535904e-11 3.779855e-10 AANAT
## 27 95.45950 3.027056 0.4494267 6.735371 1.635122e-11 2.504287e-10 AASS
## 30 417.24621 1.405086 0.2461975 5.707149 1.148844e-08 1.079280e-07 ABAT
dim(results_df)
## [1] 2778 7
At this point, the DE genes can be represented using a Volcano plot.
library(EnhancedVolcano)
## Loading required package: ggrepel
EnhancedVolcano(results_df, x="log2FoldChange", y="padj", lab=results_df$gene, xlab= "log2FoldChange", ylab="adj p-value")
ggsave(filename="H1975_Volcano.int.pdf")
## Saving 7 x 5 in image
A volcano plot is a common visualization tool used in differential expression analysis to visualize gene expression differences between conditions. It displays the relationship between statistical significance, adjusted p-value (y-axis) and the magnitude of fold change (x-axis) for each gene. The threshold of the p-value is set at 0.05 meaning that the genes above the threshold have a significant differential expression. The dots at the top-right corner of the volcano plot correspond to genes that show a high level of statistical significance (very small adjusted p-values) and a large fold change, indicating a substantial difference in expression levels between the conditions being compared. These dots represent genes that are significantly upregulated in our treated cells compared to the controls. Similarly, the dots at the top-left corner of the volcano plot correspond to genes that exhibit a high level of statistical significance and a large negative fold change. These dots represent genes that are significantly downregulated in the treated condition compared to controls. The green dots represent the genes which do not show a significant differential expression in the treatment condition compared to the controls. I saved the volcano plot into a “H1975_Volcano.int.pdf” file.
Using Gene Ontology, it is possible to perform functional enrichment analysis. This analysis compares a set of genes, such as genes that are differentially expressed, to a background set of all genes. This helps to understand the biological meaning and potential functions associated with those genes. Gene Ontology (GO) is a structured system that helps organize and describe the functions of genes and gene products. This system can be exploited to determine the biological processes of our set of genes. GO has three main categories:
Molecular functions: describes what specific tasks or jobs a gene performs at a molecular level.
Biological process: series of activities that genes are involved in to achieve a particular biological outcome.
Cellular component: where in the cell a gene’s activity takes place.
The terms in Gene Ontology are organized in a hierarchy, which means that specific terms are linked to more general terms. This helps to provide a comprehensive understanding of gene functions at different levels of detail. For this project I decided to focus on information about what the genes do and their involvement in different biological processes.
First, I have selected and stored the positively and genitively DE genes into two text files: “pos.DEG.int.txt” “neg.DEG.int.txt”. Then, these were converted into two dataframes. At this point, exploiting these sets of genes I performed the GO analysis on the GO website. Then the results were downloaded and converted into an excel file to have an easy access to the results. Then the results were converted into two dataframes: “data_pos” and “data_neg” respectively. For each dataframe, I selected the first 20 rows of the columns related to the biological process ontology, the enriched fold change (represents how much the indicated biological process is enriched) and the raw p-value (log transformed), which indicates how much we can be confident that the indicated biological process is truly enriched.
gene_list=results_df$gene
gene_list=as.data.frame(gene_list)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:matrixStats':
##
## count
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
positive_genes= dplyr::filter(results_df, log2FoldChange >0)
negative_genes=dplyr::filter(results_df, log2FoldChange <0)
pos_deg = positive_genes$gene
neg_deg = negative_genes$gene
writeLines(pos_deg, "pos.DEG.int.txt")
writeLines(neg_deg, "neg.DEG.int.txt")
pos_deg=as.data.frame(pos_deg)
neg_deg=as.data.frame(neg_deg)
head(pos_deg)
## pos_deg
## 1 A2M
## 2 AAMDC
## 3 AASS
## 4 ABAT
## 5 ABCA5
## 6 ABCB1
head(neg_deg)
## neg_deg
## 1 AADAT
## 2 AANAT
## 3 ABCA12
## 4 ABCA13
## 5 ABCC3
## 6 ABCG2
library(readxl)
data_pos = read_excel("/Users/martinaterenzi/Desktop/Exam/Pos_deg_analysis.xltx")
data_pos=as.data.frame(data_pos)
data_pos$`raw P-value`=as.numeric(data_pos$`raw P-value`)
class(data_pos$`raw P-value`)
## [1] "numeric"
data_ggp1= data.frame(data_pos$`GO biological process complete`[1:20], data_pos$`fold Enrichment`[1:20], -log(data_pos$`raw P-value`[1:20]))
colnames(data_ggp1) = c("biological_process", "fold_enrichment", "log_p_value")
dim(data_ggp1)
## [1] 20 3
Then I plotted the data. On the x-asis is represented the fold enrichment and on the y-axis are represented the first 20 biological processes. The fold enrichment is used to assess the enrichment of a particular set of genes or features within a larger reference set. It helps determine whether certain genes are more abundant or significantly represented in a specific category or condition compared to what would be expected by chance. Furthermore, as we can see the color of the bars is correlated to the log_p_value. A darker color means a lower p-value meaning that the enrichment observed is not due to random events, on the other hand, a lighter color, so a higher p-value means that it is more probable that the enrichment observed is due to chance.
library(ggplot2)
ggplot(data_ggp1) +
geom_bar(stat = "identity",
aes(y=fold_enrichment,x=biological_process, fill=log_p_value)) +
labs(fill="log_p_value") +
theme(legend.position =c("bottom"),
legend.box="horizontal",
legend.text = element_text(size = 9),
legend.key.height = unit(0.75, "cm"),
legend.key.width = unit(0.75, "cm"),
axis.text=element_text(size=7)) +
scale_x_discrete() +
labs(title="GO positive",
y="fold enrichment",
x="Biological process") +
coord_flip()
ggsave(filename="GOpos.pdf")
## Saving 7 x 5 in image
data_neg= read_excel("/Users/martinaterenzi/Desktop/Exam/Neg_deg_analysis.xlt")
data_neg=as.data.frame(data_neg)
data_neg$`raw P-value`=as.numeric(data_neg$`raw P-value`)
class(data_neg$`raw P-value`)
## [1] "numeric"
data_ggp2= data.frame(data_neg$`GO biological process complete`[1:20], data_neg$`fold Enrichment`[1:20], -log(data_neg$`raw P-value`[1:20]))
colnames(data_ggp2) = c("biological_process", "fold_enrichment", "log_p_value")
dim(data_ggp2)
## [1] 20 3
ggplot(data_ggp2) +
geom_bar(stat = "identity",
aes(y=fold_enrichment,x=biological_process, fill=log_p_value)) +
labs(fill="log_p_value") +
theme(legend.position =c("bottom"),
legend.box="horizontal",
legend.text = element_text(size = 9),
legend.key.height = unit(0.75, "cm"),
legend.key.width = unit(0.75, "cm"),
axis.text=element_text(size=7)) +
scale_x_discrete() +
labs(title="GO negative",
y="fold enrichment",
x="Biological process") +
coord_flip()
ggsave(filename="GOneg.pdf")
## Saving 7 x 5 in image
Positively DE genes: processes involved with kidneys like renal sodium ion transport, cell proliferation involved in kidney developmenand renal absorption. Cell adhesion mechanisms and collagen related processes. positive regulation of smooth muscle cell migrate and biological processes involved with hormone processing. positive regultion of heart contraction and regulation of systemic arterial blood pressure by hormone.
Negatively DE genes: mitotic spindle processes, DNA involved processes like regulation, repair and replication, regulation of chromosome condensation and kinetochore organization/assembly.
For this last part of the project, I am going to take into consideration another dataset: GSE193257. Also in this case H1975 EGFR mutant NSCLC cells were treated with 500 nM of osimertinib for 24 days (osimertinib DTPs) or treated with 500 nM of osimertinib for 21 days followed by 150 nM of AZD5153 for 3 days (osimertinib DTP AZD5153 sequential combination). In parallel, H1975 cells were grown in drug-free media for 21 days then treated with DMSO control for 3 days. The experiment was done using biological triplicates. For the aim of the project, I only selected samples with the same conditions of the previous dataset: 3 osimertinib DTPs samples and 3 DMSO samples (controls).
They profiled osimertinib DTPs using ChIP-seq, to characterize the features of these cells and performed drug screens to identify therapeutic opportunities. Fastq files were aligned using bwa mem (version 0.7.17) Bams were deduplicated using biobambam v2.0.87 (bamsormadup) and narrowPeaks were called using MACS2 (v2.2.6) Consensus peaks were geneted by opening a 500bp windown around each peak summit and a set of non-overlapping regions was determined using the maximal scoring peaks using BEDOPS v2.4. Peaks were annotated ot the nearest gene using ChIPseeker (v1.29.1)
ChIP-seq with H3K27ac allows to identify and map genomic regions enriched with this specific histone acetylation mark. H3K27ac refers to the acetylation of lysine 27 on the histone H3 protein and is generally associated with active gene expression and is often found at enhancer regions in the genome. For this reason, I expect to find an overlap between the ChIP seq peaks and the genomic regions of the DE genes which are upregulated. To test this hypothesis, I aligned the genomic regions of the upregulated DE genes with the genomic regions of the beginning and ending of the peaks and did the same for the downregukated DE genes.
First, I set the pathway for the narrowpeak file. A narrowpeak file is a standard file format used to represent genomic regions or peaks identified from various high-throughput sequencing experiments, such as ChIP-seq. It provides information about the genomic coordinates, signal intensity, and other relevant annotations for each identified peak. The narrowpeak file was stored into “peaks_osim_1”.
Narrow_file_Osim_1= "/Users/martinaterenzi/Desktop/Exam/Osim-1_H3K27Ac_hs_i53_peaks.narrowPeak"
peaks_osim_1= read.delim(Narrow_file_Osim_1, header = FALSE, stringsAsFactors = FALSE)
head(peaks_osim_1)
## V1 V2 V3 V4 V5 V6
## 1 chr1 903600 903817 X13_07BS_00YDAZ_H1975-Osim-1_H3K27Ac_hs_i53_peak_1 23 .
## 2 chr1 905094 905816 X13_07BS_00YDAZ_H1975-Osim-1_H3K27Ac_hs_i53_peak_2 120 .
## 3 chr1 940469 940830 X13_07BS_00YDAZ_H1975-Osim-1_H3K27Ac_hs_i53_peak_3 191 .
## 4 chr1 941027 941521 X13_07BS_00YDAZ_H1975-Osim-1_H3K27Ac_hs_i53_peak_4 91 .
## 5 chr1 958548 959271 X13_07BS_00YDAZ_H1975-Osim-1_H3K27Ac_hs_i53_peak_5 227 .
## 6 chr1 964958 965171 X13_07BS_00YDAZ_H1975-Osim-1_H3K27Ac_hs_i53_peak_6 25 .
## V7 V8 V9 V10
## 1 3.53702 4.15945 2.38870 152
## 2 7.27846 14.21410 12.08320 177
## 3 8.17650 21.41390 19.12180 183
## 4 4.69125 11.25050 9.19753 173
## 5 11.18970 25.11720 22.75580 521
## 6 3.31341 4.34612 2.56258 28
library(readxl)
colnames(neg_deg)[colnames(neg_deg)=="neg_deg"]="SYMBOL"
colnames(pos_deg)[colnames(pos_deg)=="pos_deg"]="SYMBOL"
Then, in order to find the genomic regions of the upregulated DE
genes I used the BiomaRt library which is used to access biological data
from Ensembl. UseMart is a function in the biomaRt package
that initializes a connection to the BioMart database allowing to
specify the database you want to access. In this case, I am using the
“ensembl” database. The overall purpose of this command is to establish
a connection to the Ensembl database and select the dataset containing
gene annotations for the human species. This allows to retrieve genomic
coordinates. The gene coordinates were stored together with the gene
symbols into a new dataframe “location_pos_genes”. The same process was
applied to the negatively regulated DE genes and the results were stored
in “location_neg_genes”.
library(biomaRt)
## Warning: package 'biomaRt' was built under R version 4.2.3
ensembl= useMart("ensembl", dataset = "hsapiens_gene_ensembl")
pos_gene_info = getBM(attributes = c("external_gene_name", "chromosome_name", "start_position", "end_position"),
filters = "external_gene_name",
values = pos_deg$SYMBOL,
mart = ensembl)
location_pos_genes= merge(pos_deg, pos_gene_info, by.x = "SYMBOL", by.y = "external_gene_name")
head(location_pos_genes)
## SYMBOL chromosome_name start_position end_position
## 1 A2M 12 9067664 9116229
## 2 AAMDC 11 77821109 77918432
## 3 AASS 7 122064583 122144308
## 4 ABAT 16 8674596 8784575
## 5 ABCA5 17 69244311 69327244
## 6 ABCB1 7 87503017 87713323
write.table(location_pos_genes, file = "location_pos_genes.txt", sep = "\t", row.names = FALSE)
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
neg_gene_info = getBM(attributes = c("external_gene_name", "chromosome_name", "start_position", "end_position"),
filters = "external_gene_name",
values = neg_deg$SYMBOL,
mart = ensembl)
location_neg_genes= merge(neg_deg, neg_gene_info, by.x = "SYMBOL", by.y = "external_gene_name")
head(location_neg_genes)
## SYMBOL chromosome_name start_position end_position
## 1 AADAT 4 170060222 170091699
## 2 AANAT 17 76453351 76470117
## 3 ABCA12 2 214931542 215138626
## 4 ABCA13 7 48171458 48647497
## 5 ABCC3 17 50634777 50692253
## 6 ABCG2 4 88090150 88231628
write.table(location_neg_genes, file = "location_neg_genes.txt", sep = "\t", row.names = FALSE)
location_pos_genes = read.table("location_pos_genes.txt", header = TRUE, sep = "\t")
location_neg_genes = read.table("location_neg_genes.txt", header = TRUE, sep = "\t")
Then I created a dataframe “peaks_osim_1_regions” containing only the information that I need for the alignment.
peaks_osim_1_regions=peaks_osim_1[,c("V1", "V2", "V3", "V4")]
colnames(peaks_osim_1_regions) = c("seqname", "peakStart", "peakEnd", "peakID")
At this point I installed the GenomicRanges library, an R package, specifically designed for working with genomic intervals coordinates and with the command FindOverlaps, the overlaps between the positively DE genes and the peaks coordinates were stored into “overlapping_genes_1_pos”. Then I extracted the indices of the genes using the command subjectHIts. I retrieved the gene names associated with the overlapping genes and I converted them into a dataframe “overlapping_gene_names_osim_1_pos” which contains a single column with the gene names. These commands allow to further analyze the overlapping genes in subsequent steps of the analysis. The same process was repeated for the negatively DE genes.
library(GenomicRanges)
gene_ranges_pos= GRanges(
seqnames = rep("ArtificialChromosome", length(location_pos_genes$start_position)),
ranges = IRanges(location_pos_genes$start_position, location_pos_genes$end_position)
)
peak_osim_1_ranges = GRanges(
seqnames = rep("ArtificialChromosome", length(peaks_osim_1_regions$peakStart)),
ranges = IRanges(peaks_osim_1_regions$peakStart, peaks_osim_1_regions$peakEnd)
)
overlapping_genes_1_pos = findOverlaps(gene_ranges_pos, peak_osim_1_ranges)
overlapping_indexes_1_pos = subjectHits(overlapping_genes_1_pos)
overlapping_gene_names_osim_1_pos = location_pos_genes$SYMBOL[overlapping_indexes_1_pos]
overlapping_gene_names_osim_1_pos=as.data.frame(overlapping_gene_names_osim_1_pos)
dim(overlapping_gene_names_osim_1_pos)
## [1] 55667 1
overlapping_gene_names_osim_1_pos= subset(overlapping_gene_names_osim_1_pos, overlapping_gene_names_osim_1_pos !="NA")
dim(overlapping_gene_names_osim_1_pos)
## [1] 1730 1
head(overlapping_gene_names_osim_1_pos)
## overlapping_gene_names_osim_1_pos
## 1 CPAMD8
## 2 CPE
## 3 CPNE4
## 4 CPQ
## 5 CPS1
## 6 CPZ
gene_ranges_neg<- GRanges(
seqnames = rep("ArtificialChromosome", length(location_neg_genes$start_position)),
ranges = IRanges(location_neg_genes$start_position, location_neg_genes$end_position)
)
peak_osim_1_ranges <- GRanges(
seqnames = rep("ArtificialChromosome", length(peaks_osim_1_regions$peakStart)),
ranges = IRanges(peaks_osim_1_regions$peakStart, peaks_osim_1_regions$peakEnd)
)
overlapping_genes_1_neg = findOverlaps(gene_ranges_neg, peak_osim_1_ranges)
overlapping_indexes_1_neg = subjectHits(overlapping_genes_1_neg)
overlapping_gene_names_osim_1_neg = location_neg_genes$SYMBOL[overlapping_indexes_1_neg]
overlapping_gene_names_osim_1_neg=as.data.frame(overlapping_gene_names_osim_1_neg)
dim(overlapping_gene_names_osim_1_neg)
## [1] 32665 1
overlapping_gene_names_osim_1_neg = subset(overlapping_gene_names_osim_1_neg, overlapping_gene_names_osim_1_neg !="NA")
dim(overlapping_gene_names_osim_1_neg)
## [1] 742 1
head(overlapping_gene_names_osim_1_neg)
## overlapping_gene_names_osim_1_neg
## 400 MAP7
## 401 MAPK13
## 402 MAPK8IP2
## 403 MARCHF4
## 404 MARVELD3
## 405 MASTL
overlap_positive_1= 1730
overlap_negative_1 = 742
total_genes_DE = 2778
prop_overlap_pos_1 = overlap_positive_1 / total_genes_DE
prop_overlap_neg_1 = overlap_negative_1 / total_genes_DE
proportions_1 = data.frame(
Category = c("Positive_DE_genes", "Negative_DE_genes"),
Proportion = c(prop_overlap_pos_1, prop_overlap_neg_1)
)
png("barplot1.png", width = 3000, height = 2000, res = 400)
barplot(proportions_1$Proportion, names.arg = proportions_1$Category,
xlab = "Category", ylab = "Proportion", main = "Overlapping with ChIP peaks Osim 1",
col = c("palevioletred", "olivedrab"), ylim = c(0, max(proportions_1$Proportion) * 1.2))
dev.off()
## quartz_off_screen
## 2
At this point I represented the
number of overlaps of the positively and negatively DE genes with the
peaks using a barplot, which was saved as “barplot1.png”. As we can see,
the number of overlaps between the positively regulated genes and the
peaks is higher (1759) than the number of overlaps between the
negatively DE genes and the peaks (710). This could mean that there is a
correlation between the H3K27ac and increased gene expression.
The same analysis was performed for the other two treated samples obtaining similar results
Narrow_file_Osim_2 <- "/Users/martinaterenzi/Desktop/Exam/Osim-2_H3K27Ac_hs_i54_peaks.narrowPeak"
# Read the file into a data frame
peaks_osim_2 <- read.delim(Narrow_file_Osim_2, header = FALSE, stringsAsFactors = FALSE)
head(peaks_osim_2)
## V1 V2 V3 V4 V5 V6
## 1 chr1 817110 817382 X14_07BT_00YDAZ_H1975-Osim-2_H3K27Ac_hs_i54_peak_1 54 .
## 2 chr1 904010 905478 X14_07BT_00YDAZ_H1975-Osim-2_H3K27Ac_hs_i54_peak_2 287 .
## 3 chr1 906392 907062 X14_07BT_00YDAZ_H1975-Osim-2_H3K27Ac_hs_i54_peak_3 24 .
## 4 chr1 920598 920881 X14_07BT_00YDAZ_H1975-Osim-2_H3K27Ac_hs_i54_peak_4 63 .
## 5 chr1 923479 923761 X14_07BT_00YDAZ_H1975-Osim-2_H3K27Ac_hs_i54_peak_5 54 .
## 6 chr1 924058 925634 X14_07BT_00YDAZ_H1975-Osim-2_H3K27Ac_hs_i54_peak_6 777 .
## V7 V8 V9 V10
## 1 4.91365 7.34785 5.48118 198
## 2 13.10310 30.93860 28.73920 460
## 3 2.91276 4.16465 2.40386 614
## 4 3.66916 8.24548 6.35513 123
## 5 4.54859 7.33206 5.46603 234
## 6 23.06780 80.26830 77.75610 744
library(readxl)
peaks_osim_2_regions=peaks_osim_2[,c("V1", "V2", "V3", "V4")]
colnames(peaks_osim_2_regions) = c("seqname", "peakStart", "peakEnd", "peakID")
gene_ranges_pos<- GRanges(
seqnames = rep("ArtificialChromosome", length(location_pos_genes$start_position)),
ranges = IRanges(location_pos_genes$start_position, location_pos_genes$end_position)
)
peak_osim_2_ranges <- GRanges(
seqnames = rep("ArtificialChromosome", length(peaks_osim_2_regions$peakStart)),
ranges = IRanges(peaks_osim_2_regions$peakStart, peaks_osim_2_regions$peakEnd)
)
overlapping_genes_2_pos =findOverlaps(gene_ranges_pos, peak_osim_2_ranges)
overlapping_indexes_2_pos = subjectHits(overlapping_genes_2_pos)
overlapping_gene_names_osim_2_pos = location_pos_genes$SYMBOL[overlapping_indexes_2_pos]
overlapping_gene_names_osim_2_pos=as.data.frame(overlapping_gene_names_osim_2_pos)
dim(overlapping_gene_names_osim_2_pos)
## [1] 47100 1
overlapping_gene_names_osim_2_pos= subset(overlapping_gene_names_osim_2_pos, overlapping_gene_names_osim_2_pos !="NA")
dim(overlapping_gene_names_osim_2_pos)
## [1] 1735 1
head(overlapping_gene_names_osim_2_pos)
## overlapping_gene_names_osim_2_pos
## 1 COL4A3
## 2 COL4A4
## 3 COL4A5
## 79 CNN1
## 177 LOXL2
## 178 LOXL4
gene_ranges_neg= GRanges(
seqnames = rep("ArtificialChromosome", length(location_neg_genes$start_position)),
ranges = IRanges(location_neg_genes$start_position, location_neg_genes$end_position)
)
peak_osim_2_ranges = GRanges(
seqnames = rep("ArtificialChromosome", length(peaks_osim_2_regions$peakStart)),
ranges = IRanges(peaks_osim_2_regions$peakStart, peaks_osim_2_regions$peakEnd)
)
overlapping_genes_2_neg = findOverlaps(gene_ranges_neg, peak_osim_2_ranges)
overlapping_indexes_2_neg = subjectHits(overlapping_genes_2_neg)
overlapping_gene_names_osim_2_neg = location_neg_genes$SYMBOL[overlapping_indexes_2_neg]
overlapping_gene_names_osim_2_neg=as.data.frame(overlapping_gene_names_osim_2_neg)
dim(overlapping_gene_names_osim_2_neg)
## [1] 28029 1
overlapping_gene_names_osim_2_neg= subset(overlapping_gene_names_osim_2_neg, overlapping_gene_names_osim_2_neg !="NA")
dim(overlapping_gene_names_osim_2_neg)
## [1] 763 1
head(overlapping_gene_names_osim_2_neg)
## overlapping_gene_names_osim_2_neg
## 359 LYAR
## 360 LYPD3
## 361 LYPD5
## 362 MAB21L4
## 363 MACC1
## 364 MAD2L1
overlap_positive_2= 1735
overlap_negative_2 = 763
total_genes_DE = 2778
prop_overlap_pos_2 = overlap_positive_2 / total_genes_DE
prop_overlap_neg_2 = overlap_negative_2 / total_genes_DE
proportions_2 = data.frame(
Category = c("Positive_DE_genes", "Negative_DE_genes"),
Proportion = c(prop_overlap_pos_2, prop_overlap_neg_2)
)
Also in this case the number of overlaps between the positively regulated genes and the peaks is higher (1740) than the number of overlaps between the negatively DE genes and the peaks (742).
proportions_2 = data.frame(
Category = c("Positive_DE_genes", "Negative_DE_genes"),
Proportion = c(prop_overlap_pos_2, prop_overlap_neg_2)
)
png("barplot2.png", width = 3000, height = 2000, res = 400)
barplot(proportions_2$Proportion, names.arg = proportions_2$Category,
xlab = "Category", ylab = "Proportion", main = "Overlapping with ChIP peaks Osim 2",
col = c("orange", "skyblue"), ylim = c(0, max(proportions_2$Proportion) * 1.2))
dev.off()
## quartz_off_screen
## 2
Overlapping with ChIP Sample 2
Narrow_file_Osim_3 = "/Users/martinaterenzi/Desktop/Exam/Osim-3_H3K27Ac_hs_i60_peaks.narrowPeak"
peaks_osim_3 = read.delim(Narrow_file_Osim_3, header = FALSE, stringsAsFactors = FALSE)
head(peaks_osim_3)
## V1 V2 V3 V4 V5 V6
## 1 chr1 817183 817389 X15_07C0_00YDAZ_H1875-Osim-3_H3K27Ac_hs_i60_peak_1 67 .
## 2 chr1 905119 905791 X15_07C0_00YDAZ_H1875-Osim-3_H3K27Ac_hs_i60_peak_2 58 .
## 3 chr1 920699 920905 X15_07C0_00YDAZ_H1875-Osim-3_H3K27Ac_hs_i60_peak_3 25 .
## 4 chr1 923412 923766 X15_07C0_00YDAZ_H1875-Osim-3_H3K27Ac_hs_i60_peak_4 18 .
## 5 chr1 924684 924987 X15_07C0_00YDAZ_H1875-Osim-3_H3K27Ac_hs_i60_peak_5 73 .
## 6 chr1 932456 932670 X15_07C0_00YDAZ_H1875-Osim-3_H3K27Ac_hs_i60_peak_6 31 .
## V7 V8 V9 V10
## 1 5.46027 8.68664 6.76320 130
## 2 5.09759 7.74435 5.85295 186
## 3 2.57930 4.24476 2.51557 96
## 4 2.85564 3.53412 1.85207 59
## 5 4.75065 9.25272 7.30868 146
## 6 3.73286 4.96282 3.19532 192
library(readxl)
peaks_osim_3_regions=peaks_osim_3[,c("V1", "V2", "V3", "V4")]
colnames(peaks_osim_3_regions)= c("seqname", "peakStart", "peakEnd", "peakID")
gene_ranges_pos= GRanges(
seqnames = rep("ArtificialChromosome", length(location_pos_genes$start_position)),
ranges = IRanges(location_pos_genes$start_position, location_pos_genes$end_position)
)
peak_osim_3_ranges = GRanges(
seqnames = rep("ArtificialChromosome", length(peaks_osim_3_regions$peakStart)),
ranges = IRanges(peaks_osim_3_regions$peakStart, peaks_osim_3_regions$peakEnd)
)
overlapping_genes_3_pos = findOverlaps(gene_ranges_pos, peak_osim_3_ranges)
overlapping_indexes_3_pos = subjectHits(overlapping_genes_3_pos)
overlapping_gene_names_osim_3_pos = location_pos_genes$SYMBOL[overlapping_indexes_3_pos]
overlapping_gene_names_osim_3_pos=as.data.frame(overlapping_gene_names_osim_3_pos)
dim(overlapping_gene_names_osim_3_pos)
## [1] 60324 1
overlapping_gene_names_osim_3_pos= subset(overlapping_gene_names_osim_3_pos, overlapping_gene_names_osim_3_pos !="NA")
dim(overlapping_gene_names_osim_3_pos)
## [1] 1716 1
head(overlapping_gene_names_osim_3_pos)
## overlapping_gene_names_osim_3_pos
## 1 CPQ
## 2 CPS1
## 3 CPZ
## 4 CRACR2A
## 5 CREB5
## 6 CREBRF
gene_ranges_neg= GRanges(
seqnames = rep("ArtificialChromosome", length(location_neg_genes$start_position)),
ranges = IRanges(location_neg_genes$start_position, location_neg_genes$end_position)
)
peak_osim_3_ranges = GRanges(
seqnames = rep("ArtificialChromosome", length(peaks_osim_3_regions$peakStart)),
ranges = IRanges(peaks_osim_3_regions$peakStart, peaks_osim_3_regions$peakEnd)
)
overlapping_genes_3_neg = findOverlaps(gene_ranges_neg, peak_osim_3_ranges)
overlapping_indexes_3_neg = subjectHits(overlapping_genes_3_neg)
overlapping_gene_names_osim_3_neg = location_neg_genes$SYMBOL[overlapping_indexes_3_neg]
overlapping_gene_names_osim_3_neg=as.data.frame(overlapping_gene_names_osim_3_neg)
dim(overlapping_gene_names_osim_3_neg)
## [1] 35586 1
overlapping_gene_names_osim_3_neg= subset(overlapping_gene_names_osim_3_neg, overlapping_gene_names_osim_3_neg !="NA")
dim(overlapping_gene_names_osim_3_neg)
## [1] 715 1
head(overlapping_gene_names_osim_3_neg)
## overlapping_gene_names_osim_3_neg
## 459 MMD
## 460 MMP1
## 461 MMP13
## 462 MMP9
## 463 MMS22L
## 464 MND1
overlap_positive_3= 1716
overlap_negative_3 = 715
total_genes_DE = 2778
prop_overlap_pos_3 = overlap_positive_3 / total_genes_DE
prop_overlap_neg_3 = overlap_negative_3 / total_genes_DE
proportions_3 = data.frame(
Category = c("Positive_DE_genes", "Negative_DE_genes"),
Proportion = c(prop_overlap_pos_3, prop_overlap_neg_3)
)
png("barplot3.png", width = 3000, height = 2000, res = 400)
barplot(proportions_3$Proportion, names.arg = proportions_3$Category,
xlab = "Category", ylab = "Proportion", main = "Overlapping with ChIP peaks Osim 3",
col = c("purple", "red"), ylim = c(0, max(proportions_3$Proportion) * 1.2))
dev.off()
## quartz_off_screen
## 2
Also in this last sample the
number of overlaps between the positively regulated genes and the peaks
is higher (1699) than the number of overlaps between the negatively DE
genes and the peaks (647).
Overall, in all three samples we can find comparable results.
As already mentioned, Third-generation EGFR tyrosine kinase inhibitors (EGFR-TKIs), such as osimertinib (irreversible EGFR-TKI) are vital in the treatment of non-small cell lung cancer with EGFR-TKI sensitizing or EGFR T790M resistance mutations. Although osimertinib provides clinical benefits to patients, disease progression and drug resistance are common occurrences. One proposed mechanism for the development of resistance involves the emergence of a drug tolerant persister (DTP) cell population, leading to de novo acquired resistance to osimertinib and other targeted cancer therapies. In this study, the scientists conducted RNA-seq analysis to examine the characteristics of osimertinib drug tolerant persisters (DTPs).
In my work I handled the data to answer some biological questions. First the data was normalized to allow the visualization of the data through PCA and then I performed a differential expression analysis in which I was able to identify the overexpressed genes and under expressed genes of cells treated with osimertinib compared to the controls. At this point I performed a gene ontology analysis to understand the biological meaning and potential functions associated with differentially expressed genes.
The epidermal growth factor receptor (EGFR) is a cell surface receptor that plays a crucial role in regulating various cellular processes, including cell division, cell migration, and cell survival.
It is interesting to notice that in the gene ontology analysis we do not find only processes specifically related to the development of a resistance of cells against Osimertinib or strictly related to the onset of a tumor. Indeed, Positively DE genes are mainly correlated to processes involved with kidneys like renal sodium ion transport, cell proliferation involved in kidney development and renal absorption. Other biological processes are involved in cell adhesion mechanisms and collagen related processes. These processes may be related to the development of a resistance against Osimertinib by DTPs. In fact, as already mentioned, EGFR plays a crucial role in regulating various cellular processes, including cell division, cell migration, and cell survival. Indeed, the developemnt of a resistance against osimertinib is related to the activation and overexpression of some of these processes that we can see on the GO graph. Furthermore, upregulated genes seem correlated also to positive regulation of smooth muscle cell migrates, biological processes involved with hormone processing and positive regulation of heart contraction together with regulation of systemic arterial blood pressure by hormones.
On the other hand, downregulated genes are mainly correlated to DNA involved processes like regulation, repair and replication, regulation of chromosome condensation, mitotic spindle processes and kinetochore organization/assembly. As expected, this downregulation of biological processes involved with DNA repair is in line with the onset of a malignant tumor and with the development of a resistance of DTPs, which is characterized by the repression of oncosuppressors. Indeed, oncosuppressors are a group of genes that play a crucial role in preventing the development and progression of cancer. These genes regulate cell growth, division, and death, and help maintain the stability of the genome. When functioning normally, oncosuppressors inhibit the formation of tumors by suppressing abnormal cell growth and promoting cellular repair mechanisms.
Finally, in the last part of the analysis I compared the obtained differentially expressed genes with a ChIP seq experiment. ChIP-seq has been used to investigate protein-DNA interactions and identify regions of H3K27ac. By performing ChIP-seq with an antibody specific for H3K27ac, researchers were able to map the genomic regions where this histone modification is present. This provides insights into the active regulatory elements and potential target genes associated with our specific condition. I performed an overlap between the upregulated DE genes and the regions of H3K27ac peaks and I did the same for the downregulated DE genes. As expected, there is a higher overlap between the positively regulated genes and H3K27ac peaks compared to the negatively regulated genes. This is because we know that a gene with a higher activity is characterized by active regulatory elements and so with a higher probability of H3K27 acetylation. On the other hand, downregulated genes are characterized by a reduced activity and so by a reduced acetylation of H3K27.
In conclusion, the RNA-seq analysis coupled with the ChIP-seq experiment allowed to unvelop some of the fundamental features of osimertinib drug tolerant persisters. The collected information could be exploited to further uncover new therapeutic opportunities for NSCLC EGFR mutant cells treatment.
• Association, A.L. Types of lung cancer, Lung Cancer Basics.
• Gridelli C, Rossi A, Carbone DP, Guarize J, Karachaliou N, Mok T, Petrella F, Spaggiari L, Rosell R. 2015. Non-small-cell lung cancer. Nature Reviews Disease Primers 1:15009. DOI: 10.1038/nrdp.2015.9.
• Hsu P-C, Jablons DM, Yang C-T, You L. 2019. Epidermal Growth Factor Receptor (EGFR) Pathway, Yes-Associated Protein (YAP) and the Regulation of Programmed Death-Ligand 1 (PD-L1) in Non-Small Cell Lung Cancer (NSCLC). International Journal of Molecular Sciences 20:3821. DOI: 10.3390/ijms20153821.
• Malapelle U, Ricciuti B, Baglivo S, Pepe F, Pisapia P, Anastasi P, Tazza M, Sidoni A, Liberati AM, Bellezza G, Chiari R, Metro G. 2018. Osimertinib. In: Martens UM ed. Small Molecules in Oncology. Recent Results in Cancer Research. Cham: Springer International Publishing, 257–276. DOI: 10.1007/978-3-319-91442-8_18.
• Prabhakar CN. 2015. Epidermal growth factor receptor in non-small cell lung cancer. Translational Lung Cancer Research 4:110–118. DOI: 10.3978/j.issn.2218-6751.2015.01.01.